Brain Stroke prediction - Homework 3

Introduction

I'm going to explore brain stroke factors and their relevance using Local Interpretable Model-agnostic Explanations (LIME).

Part 3

Compare LIME for various observations in the dataset. How stable are these explanations?

For first observation, we observe that the results doesn't vary much based on the seed. work_type_children then age are the most important. Five most positively correlated factors are unchanged and in same order of relevance. The difference appears in the factors of smaller relevance. For example heart_disease has ~0.045 absolute value with seed 1, ~0.08 with seed 2 and less than ~0.02 with seed 0.

Similairly now, the factors seem stable based on the seed. Also the values seem constisted with varying observations - work_type_children <= 0 and other factors seem to have similar effects on final result, no matter the other observations. The main difference between this and previous observation is the age of person - output prediction is different because people are of different age.

Part 4

For everything, but age of person 0, two packages agree on whether or not the correlation is positive or negative. They seem to differ on the order of importance. Also they seem to put different value on different attributes, for example LIME puts more importance on work_type_children where shap considers it less important than many other factors.

Part 5

In XGB all attributes have much smaller values. Also XGB tends to put more focus on bmi, which logistic regression tends to classify as less relevant than other features. On the other hand working with children is important for logistic regression and of low relevancy for XGB.

For person 2, they have opposite correlation of having private work, similairly govt job for person 0. On the other hand their values are so close to 0, that I'd assume their impact is negligible. Both models seem to agree that for elderly people age is the most important factor.